Properties of the Least Squares Temporal Di erence learning algorithm
نویسندگان
چکیده
This paper focuses on policy evaluation using the well-known Least Squares Temporal Di erences (LSTD) algorithm. We give several alternative ways of looking at the algorithm: the operator-theory approach via the Galerkin method, the statistical approach via instrumental variables as well as the limit of the TD iteration. Further, we give a geometric view of the algorithm as an oblique projection. Moreover, we compare the optimization problem solved by LSTD as compared to Bellman Residual Minimization (BRM). We also treat the modi cation of LSTD for the case of episodic Markov Reward Processes. The main practical problem that the LSTD algorithm solves is such: we are given a feed of data from a stochastic system, consisting of a state description in terms of features and of rewards. The task is to construct an abstraction that maps from states to values of states, where the value is de ned as the discounted sum of future rewards. We will show that for LSTD, this abstraction is a linear model. For example, the system may describe a chess game, the features of state may describe what pieces the players have while the reward signal corresponds to wither winning or losing the game. The value signal will then correspond to the value of having each particular piece. Note that this is not a general constant but may depend on the way the individual players play the game, for example the values may be di erent for humans than for computer players. It is well-known that the value function of a given policy can be expressed as V = (I − γP )−1R. The LSTD algorithm can be thought as a way of computing the value of this function approximately. The motivation for why the approximation is often necessary is threefold. First, we may not have access to the states directly, just to functions φ of state. Second, the number of states n is often computationally intractable. Third, even if n is tractable, there is the problem of statistical tractability the number of samples needed to accurately estimate transition matrices n × n is often completely prohibitive. Associated with our problem setting is the question whether the value function is interesting in its own right, or whether we only need it to adjust the future behaviour of some aspect of the environment we can control (i.e. in our chess example. make a move). We believe that there is large scope of systems (for instance expert systems) where the focus will be on gaining insight into the behaviour of the stochastic system, but the decisions about whether or how to act will still be made manually by human controllers, on the basis of the value-function information. These are the cases where algorithms like LSTD are the most directly applicable. On the other side of the spectrum, there will also of course be situations where the value function estimate is used as a tool to automatically generate the best action on the part of the agent such systems may also use value-function estimation algorithms of the kind of LSTD to operate within the policy iteration framework. 1 Prior Work on LSTD An exhaustive introduction to least-squares methods for Reinforcement Learning is provided in chapter 6 of Bertsekas' monograph [6]. The LSTD algorithm was introduced in the paper by Bradtke and Barto [8]. Boyan later extended to the case with eligibility traces [7], wherean additional parameter λ controls how far back the updated are in uenced by previous states. The connection between LSTD and LSPE, as well as a clean-cut proof that the on-line version of LSTD converges, was given by Nedi¢ and Bertsekas [21]. The seminal paper [28] by Tsitsiklis and Van Roy provided an explicit connection between the x-point of the iterative TD algorithm and the LSTD solution, while also formally proving that the TD algorithm for policy evaluation converges. The paper [3], described the Bellman Residual Minimization procedure as an alternative to TD. Antos' paper [1] provided an extensive comparison on the similarities and di erences between LSTD and Bellman Residual Minimization (BRM). Parr's paper [16] introduced the LSPI algorithm as a principled way to combine LSTD with control. The paper by Munos [20] introduced bounds for policy iteration with linear function approximation, albeit under strong assumptions. Scherrer provided [24] the geometric interpretation of LSTD as an oblique projection, in the context analysing the di erences between LSTD and BRM. The paper [14] represents an early approach to automatically constructing features for RL algorithms, including LSTD. Schoknecht gave [25] an interpretation of LSTD and other algorithms in terms of a projection with respect to a certain inner product. Choi and Van Roy [9] discuss the similarities between LSTD and a version of the Kalman lter. There exist various approaches in literature to how LSTD can be regularized, none of which can be conclusively claimed to outperform the others. These include the L1 approaches of [18] and [15] and the nested approach of [13]. These approaches di er not just in the what regularization term in used, but they solve di erent optimization problems (we will discuss this in section 5).
منابع مشابه
An Analysis of Temporal Di erence Learning with Function Approximation
We discuss the temporal di erence learning algorithm as applied to approximating the cost to go function of an in nite horizon discounted Markov chain The algorithm we analyze updates parameters of a linear function approximator on line during a single endless traject ory of an irreducible aperiodic Markov chain with a nite or in nite state space We present a proof of convergence with probabili...
متن کاملLeast-Squares Temporal Di erence Learning
TD( ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD( ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes ine cient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the Least-Squ...
متن کاملAn Analysis of Temporal - Di erence Learning with Function Approximation 1
We discuss the temporal-di erence learning algorithm, as applied to approximating the cost-to-go function of an in nite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator on{line, during a single endless trajectory of an irreducible aperiodic Markov chain with a nite or in nite state space. We present a proof of convergence (with proba...
متن کاملAnalytical Mean Squared Error Curves in Temporal Di erence Learning
We have calculated analytical expressions for how the bias and variance of the estimators provided by various temporal di erence value estimation algorithms change with o ine updates over trials in absorbing Markov chains using lookup table representations. We illustrate classes of learning curve behavior in various chains, and show the manner in which TD is sensitive to the choice of its steps...
متن کاملEvolutionary Algorithms for Reinforcement Learning
There are two distinct approaches to solving reinforcement learning problems, namely, searching in value function space and searching in policy space. Temporal di erence methods and evolutionary algorithms are well-known examples of these approaches. Kaelbling, Littman and Moore recently provided an informative survey of temporal di erence methods. This article focuses on the application of evo...
متن کاملLeast-squares temporal difference learning based on extreme learning machine
This paper proposes a least-squares temporal difference (LSTD) algorithm based on extreme learning machine that uses a singlehidden layer feedforward network to approximate the value function. While LSTD is typically combined with local function approximators, the proposed approach uses a global approximator that allows better scalability properties. The results of the experiments carried out o...
متن کامل